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An unsupervised learning procedure based on maximizing the mutual information 
between the outputs of two networks receiving different but statistically dependent 
inputs is analyzed (Becker and Hinton, Nature, 355, 92, 161). For a generic data 
model, I show that in the large sample limit the structure in the data is recognized by 
mutual information maximization. For a more restricted model, where the networks 
are similar to perceptrons, I calculate the learning curves for zero-temperature Gibbs 
learning. These show that convergence can be rather slow, and a way of regularizing 
the procedure is considered. 



PACS: 84.35. +i,89.20.Ff, 64.60. Cn 



I. INTRODUCTION 

In unsupervised learning one often tries to find a mapping a of a high dimensional signal 
X into a simple output space Y which preserves the interesting and important features of the 
signal. The statement of the problem is rather vague and a wealth of algorithm exist for the 
task which often define the meaning of "interesting and important" in terms of the algorithm 
itself |1] . In search for a principled approach, it seems natural to turn to information theory 
and to require that the mutual information I(X; cr(X)) between the signal X and its encoding 
<j(X) should be large. Unfortunately, this is often a trivial problem. If one component of X, 
say the first one, has a continuous distribution, the mutual information between X and this 
component is infinite and so I(X; cr(X)) can be maximized by simply choosing a to project 
X onto its first component. 

To arrive at a meaningful task one has thus considered maximizing I(X; a(X + r])), where 
1] is isotropic Gaussian noise 0. Then if a is constrained to be linear and X is Gaussian, 
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the problem becomes equivalent to principal component analysis, but one can also consider 
nonlinear choices for o. The drawback of this approach is that if one reparameterizes X, 
setting X = ip(X), then maximizing I(X;a(X + 77)) will in general yield quite different 
results even if -0 is a simple linear and volume preserving mapping. So in this approach 
the meaning of interesting and important is implicitly defined by the choice of a coordinate 
system for X. 

It is much more natural to apply information theory when considering the related scenario 
that one has access to two signals Xi and X 2 which are different but statistically dependent. 
For instance X\ might be a visual and X 2 the corresponding auditory signal. Then I(X\\ X 2 ) 
is a reparameterization invariant measure of the statistical dependence of the two signals and 
one can ask for a simple encoding of X\ which preserves the mutual information of the two 
signals. So in this scenario one will look for a mapping <j\ of X\ into a simple output space 
Yi for which I{p\{X\)\ X 2 ) is large. This is the basic idea of the information bottleneck 
method 0,3 

In the same setting, a more symmetric approach has been proposed by Becker and Hinton 
0|. The idea is to look for simple encodings (J\,(J 2 of both signals which yield a large 
value of I(<Ji(Xi); a 2 (X 2 )). An attractive feature of this approach is that to compute the 
mutual information of the encodings one has to estimate probabilities only in the simple 
output spaces Yi and Y2 and not in the high dimensional space of the signals themselves. 

While the main thrust of this paper is to analyze Becker and Hinton's proposal using 
statistical physics, I shall first give some general characteristics of what can be learned 
by maximizing / for a large class of scenarios where the approach seems suitable. I then 
specialize to the case that the <7j are perceptron like architectures with discrete output 
values and setup a framework for analyzing learning from examples in the thermodynamic 
limit. Next, some learning curves obtained for specific cases are discussed, and I conclude 
by addressing the limitation of the presented approach and some insights gained from it. 



II. GENERAL CHARACTERISTICS 

In general terms the mutual information of X\ and X 2 is the KL-divergence between the 
joint distribution of X 1 and X 2 and the product distribution of their marginal distributions. 
If the variables have probability densities this definition reads: 

dzidz 2 p(zi,z 2 )log 2 ^ . (1) 

I(Xi, X 2 ) is nonnegative and vanishes only if X\ and X 2 are independent. So a positive value 
indicates statistical dependence, and the ideal scenario for Becker and Hinton's proposal is 
that this dependence is such that for suitable functions T\ and r 2 we have T\{X\) = r 2 (X 2 ) for 
any possible joint occurrence of a pair (A 1; X 2 ). For instance, Ti(Xi) might be the common 
cause of the two signals. I shall further assume that the knowledge of Ti(Xl) (or t 2 (X 2 )) 
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encapsulates the entire statistical dependency of the two signals, so that the joint density of 
(Xi,X 2 ) can be written as 

p(x 1 ,x 2 ) = Sri{xi),T2{x2) p(x 1 )p(x 2 ) . (2) 

Z Tl(xi) 

For brevity I have assumed that the r< take on discrete values, so 5 refers to Kronecker's 
delta and the normalization is 

z k = Prob[r 1 (X 1 ) = fc] = Prob[r 2 (X 2 ) = fc]. (3) 

If the joint distribution of the signals is given by (J2J), it makes sense to ask whether the r< 
can be inferred by observing only (Xi, X 2 ). This naturally leads one to consider the mutual 
information because a simple calculation shows that J(Xi;X 2 ) = J(ri(Xi); r 2 (X 2 )). In the 
appendix I show, using standard information theoretic relations, that any two mappings (7j 
which also preserve the mutual information, J(Xi;X 2 ) = I(ai(Xi); a 2 (X 2 )), are related to 
the Ti in a simple way. Namely, 

Ti(Xi) = <j>i{<Ji(Xi)) (4) 

holds identically for suitable mappings fa, and in this sense the T{ provide a simplest de- 
scription of the data. If the cij have the same number of output values as the r,, the fa 
can only be permutations. Of course, as an unsupervised learning procedure maximizing 
I((Ji(Xi); cr 2 (X 2 )), does not fix specific values for the output labels. Despite of this, I shall 
sometimes call the r, teachers and take such trivial permutational symmetries into account 
only tacitly. 

Realistically, one will not be able to choose the U{ based on the knowledge of the entire 
distribution of (X 1; X 2 ), but only have access to a training set D of finitely many example 
pairs (XfjXg) sampled independently from (X 1; X 2 ). For a given o = (ai,a 2 ), a pair of 
students, one will then compute the empirical frequencies 

m 2 

Pu 1>Va (P,<T) = —^2YlS ViMx ^, (5) 

1-1 

where m is the number of examples in D. Then the discrete version of (JJ) allow us to 
determine the empirical mutual information /(D, a) of the student pair on the training set 
by 

u f^ =l P U i,(D,o-)p. )U2 (D,o-) 

here K is the number of output classes and the explicit formula for the first marginal in © 
is p uu .(p,a) = ££=iP«i,« a OD,0')- 

When learning, one has to restrict o\ and a% to lie in a predefined set of functions and the 
obvious strategy is to choose a pair maximizing I(D,a). Of course, Eq. (@J) will then only 
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hold in the limit m — > oo of an infinite training set, and a key issue is to quantify the speed 
of this convergence. This seems especially important since the number of values taken on by 
the Tj is in general not known. So it is quite possible that K is chosen too large. Then, even 
in the infinite training set limit, there can be many different function pairs where <7j takes on 
all of the K values, /(ct^Xl); cr 2 (X 2 )) is maximized, but </>;(<7j(Aj)) = 7"j(JQ) can satisfied by 
mappings (f>i which merge class labels. Thus one will not expect that the number of classes 
in the data is automatically inferred by mutual information maximization and will have to 
experiment with different values of K, considerably increasing the risk of over-fitting. 



Let us now assume that the Tj are perceptrons which yield output values in 0, . . . , K — 1, 
and each t\ is characterized by an N dimensional weight vector B{ of unit length and scalar 
biases nf, k = 1, . . . , K — 1. On an iV-dimensional input Aj the output of then is 

A'-l 



where B is the 0, 1 step function. While Eq. (JZJ) is invariant w.r.t. permutations of the biases, 
for brevity, I shall always assume that the bias terms are in ascending order (k^ < 
The marginal densities p(xi) and p(x 2 ) which are used to define the joint density of the data 
(J2J), are assumed to have independent Gaussian input components with mean and unit 
variance. Then, to satisfy condition (JHJ), the biases of t± and r 2 must be equal, 

We assume that the general architecture of the teachers is known, and focus on pairs 
of students <7j performing a classification analogous to Eq. ((7j) but with weight vectors Jj 
and biases Af. Note that while formally I assume that the number of biases is the same for 
teachers and students, this does not restrict generality. For instance, a scenario where the 
teachers have fewer output classes than the students is obtained by choosing some of the n h 
to be equal. 

The performance of a student pair is then assessed using © to determine J(0,cr). To 
investigate, in the thermodynamic limit, the typical properties of maximizing I (D, a), one 
has to fix a prior measure on the parameters of the students. For the weight vectors, we 
assume that the Jj are drawn from the uniform density dJ on the unit sphere. As there 
are only finitely many the results for N —>■ oo do not depend on the prior density dA on 
the biases as long as the density vanishes nowhere. One could now consider the partition 
function 



for the Gibbs weight e f3NI ( D '°~ S) on the space of students. But a key technical difference 
to many other learning paradigms is, that this Gibbs weight does not factorize over the 



III. STATISTICAL PHYSICS 




(7) 



k=l 




(8) 
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examples. There are, however, some special cases, namely if there are just two output 
classes and no biases, where one can replace J(D, u) by an equivalent cost function which 
is just a sum over examples Then maximizing /(©, a) is closely related to a supervised 
learning problem for parity machines. 

Here, I want to analyze more general scenarios and it is easier not to start with e^ 7 ^ ' ") 
but to introduce target values t UljU2 for the empirical frequencies p UltU2 (3, a) which determine 
/(D, a). We now consider the partition function 

Z = jdjjdX ]J e ^(^-^{t UltU2 -p UuU2 (B,a)) 2 ^ . (9) 

Analyzing the divergence of InZ for (3 — > oo, then tells us if the target values are feasible, 
i.e. whether student networks cr^ exist with t ul>U2 = p Ul ^ U2 {D,a). 

In the thermodynamic limit one will expect to find two regimes: As long as the number 
of training examples m is small compared to N, it will be possible to find students which 
achieve the global maximum log 2 K of the mutual information. In terms of the target values 
this means that t ul}U2 = K~ 1 5 UltU2 is feasible, and we need to study the partition function 
(JHJ) for this choice of t ul ^ U2 . Once the ration a = m/N becomes large enough, there will in 
general be no students a such that /(D, a) = log 2 K and we need to determine the achievable 
empirical frequencies by finding feasible target values of t Uim using Eq. (JOJ). We can then 
search for the feasible target values which yield the maximal mutual information 1(a). 

For both regimes the starting point is to factorize © over the patterns, linearizing the 
exponent by an integral transform with Gaussians L UljU2 of mean and unit variance: 

e -^(i U1 , tl2 -p U1 , tl2 ( ICI) . ")) 2 = / e ii «l,«2V / 3JV(^i,«2-P«1^2( 1D) ' cr ))\ . (10) 

One now employs standard arguments to calculate the quenched average in the thermody- 
namic limit and finds, within a replica symmetric parameterization, 

lim iV~ 1 (lnZ) ID) = maxminaGo(L) + aGi(R, A, q, L) + G2(R, q), 

L 2 

Go(L) = 2 + L UliU2 t ul>U2 

G2(R,g) = ^Et^ + MI-^)- (11) 

z . i qi 

Here Ri = Jj Bi is the typical overlap with the teacher of a student picked from the Gibbs 
distribution (jDJ) and qi is the squared length of the thermal average of J,. Further 

Gi(R, A, q, I) = U {Riq7h} (yi, y 2 ) In ^ e" L — J] H Ui (\, q t , y t ) \ (12) 

\ lQl ui,«a i I yuy2 
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where the yi are independent Gaussians with mean and unit variance. Further 



X-tUQi J k 

with 



Here H(z) is Gardner's i7-function and to define Eq. (J 14)) for U{ = and Uj = K — 1, we 

_ i 

adopt the convention that A° = — oo and \f = oo. The definition of H k {K 1 R i q i 2 ,yi) is 
entirely analogous, also using k° = — oo and k k = oo. 

Note that the physical interpretation of the auxiliary variables L Ul)U2 is that a student 
pair a picked from the Gibbs density will yield empirical frequencies p ul:U2 {3, a) = t Ul;U2 + 
L Ul ,u 2 /P- Reasonably, one will only consider target values t Ul)U2 for these frequencies which 
sum to 1, and then the stationary values of L Uim must sum to 0. This can of course also 
be obtained by direct manipulation of Eq. (JTTJ). 

We are mainly interested in evaluating for (3 — > oo. The stationarity conditions for 
the order parameters yield that the scaling of a conjugate L Ul ^ U2 in this limit will depend 
on whether t UuUa is positive or zero. Denoting by S t the support of t, i.e. the set of pairs 
u = (u\,U2) for which t UljU2 > 0, the stationarity conditions yield that L U1)U2 diverges with 
(3 as ln/3 if u £ S t . But for positive t ul)U2 , if t is feasible, L U1)U2 diverges as — ln/3, while for 
two pairs u,u € S t , the difference between the conjugates 

■LuijUZ -^- / Ml,M2 ^ni,«2 ^U1,U2 (1*^) 

stays finite for large (3. Thus one obtains for the limit (3 — > oo 

lim N~ l (hiZ)j, = maxminaN l Ul U2 t ui U2 + aGi(R 7 A, q, I) + G%(R, q) 

N—*ac ' R,X q,l ^— — 4 

u<=St 



Gi(R, A, q, I) = If i (yi,y 2 )lnVe l ^T\H u .{\ i ,q i ,y i 



(16) 



When the mutual information is maximized by marginally feasible target values realized by 
only a single pair of students, we need to consider the limit q$ — > 1 in (|16p. As usual, the 
the sum over u in G*i is dominated by its largest term in this limit. Setting 

H uMiiVi) = 2lim(l-q i )lnH Ui (X i ,q i ,y i ) 

u 9 (yi,y 2 ) = argmax{0 ul)Ua +H* 1 (X 1 ,y 1 ) + H* 2 (\ 2 ,y 2 )/j} (17) 
ueSt 

where, for q^ — > 1, 7 is the ratio and <7 ul)Wa = / UljU2 (l — 91), one obtains: 

t«i,« 2 = \f{Ri}{y^yz)b{u u u2),uB{ yi ,y 2 )) my2 

l-R- = IteW^CA*^)) • (is) 
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FIG. 1: Learning curves for students with K = 2 output classes. The grey lines are for the random 
map problem, the thin black lines for a pair of teachers with two output classes and = 1. 

The interpretation of the above equations is that the target values t ul>U2 are marginally 
feasible for some value of a if one can find Ri, A«, g Ul ,u 2 an d 7 such that (fl~8|l holds for 
«i = 0, . . . , K — 1 and i — 1, 2. 

Using the above results, the learning curves for maximizing /(ID), a) in the large N limit 
can be calculated. In the regime where 1(a) = log 2 K we use (|TT|) with the target values 
tui,u 2 = K^ 1 5 Ul ^ U2 . But above a critical number of examples 1(a) will be smaller than log 2 K. 
The using f)18|) to find the feasible targets t UlyU2 which maximize the mutual information, 
amounts to solving a constrained optimization problem. 

IV. LEARNING CURVES 

Before considering example scenarios, some words on numerically solving Eqs. (fTrjj) or 
(fTHj) are in order. This turns out to be a non trivial task since averages of functions have to 
computed which are quite non-smooth, once the are close to 1 in Eq. (jl6j) . and become 
discontinuous for Eq. (fT%|) . To achieve reliable numerical results, I have found it necessary 
to explicitly divide the two dimensional domain of integration into sub-regions where the 
integrand is both continuous and different iable. The number of sub-regions one has to 
consider increases quite rapidly with K. 

Further, I have generally assumed site symmetry, Ri = R, Aj = A, g« = q, although I did 
numerically check the local stability of the solution thus obtained for some points on the 
learning curves. 

The simplest case is that the students have K = 2 output classes and it is useful to first 
consider a degenerate scenario where the teachers have just a single output. So I(Xi, X 2 ) = 
and the two signals are in fact independent. This is analogous to the random map problem 
in supervised learning, since nothing can be learned, and any pair of students will perform 
equally badly on the whole distribution of inputs. But for finite a, up to a = 11.0 , one 
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FIG. 2: Learning curves obtained when the students and the pair of of teachers have two output 
classes but = 0.5. 

can find student pairs achieving the maximal value 1(B), a) = 1, as shown in Fig. 1. Above 
this critical value the maximal empirical mutual information 1(a) starts to decay to zero, 
the feasible target matrix t becomes non-diagonal but the value of the bias A* 1 ) is still zero. 
While above a = 11.0 student pairs with a diagonal t do exist, and have a nonzero 
these pairs do not maximize 1(B), a). 

The random map problem is relevant for learning since the students always have the 
option of ignoring the structure in the data. Formally, when R = a learning problem 
with I(Xi,X 2 ) > is equivalent to the I(Xi,X 2 ) = case. This is illustrated (also Fig. 
1) by a scenario where the teachers have two output classes and = 1. This yields the 
moderate value I(Xi,X 2 ) = 0.631. But up to a = 22.3 the structure present in the data is 
not recognized at all and we observe the same behavior as for random examples. At a = 22.3 
a first order phase transition occurs where R and A^ jump from zero to values which are 
already close to 1. 

When choosing = 0.5, still for K — 2, a different behavior is observed since I(X\, X 2 ) 
is now quite close to 1. The phase where 1(a) = 1 is now a bit longer, extending up to 
a = 11.1. But already in this phase the order parameters show a non trivial behavior. 
The value of R becomes positive above a = 3.0 but is not monotonic in a. So, while 
some structure is recognized in this phase due to entropic effects, the recognition is rather 
unreliable. This is also highlighted by the behavior of A^. While it is nonzero above a = 3.0, 
it initially even has very small negative values (not visible in Fig. 2). Above a = 11.1, when 
1(a) < 1, robust convergence of the order parameters to their asymptotic values sets in. 

Turning to K = 3 (outputs 0,1 or 2), we again first consider the case of random examples. 
For all values of a the bias term satisfies the symmetry \^ = — A*- 1 -*. The phase where 1(a) 
has the maximal possible value, which now equals log 2 3, is shorter than for K = 2, extending 
till a = 6.96 as shown in Fig. 3. Above a = 6.96 the t-matrix is still diagonal initially. 

In this initial phase A^ decreases with a, this narrows the gap between the output classes 
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FIG. 3: Learning curves for students with K = 3 output classes. The grey lines are for the random 
map problem, the black lines for a pair of teachers with three output classes and = —k^ = 1.21 
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FIG. 4: Feasible t values for 3 output labels and random examples, a = too,b = to2,c = t\\ as in 
Eq. P|). 

and 2, making it easier to find a student pair with to2 = 0. Remarkably, beyond a ~ 8 one 
finds A^ 2 -* = A^ = but t\\ > as shown in Fig. 4. This verges on the paradoxical since 
by definition a student with A^ = X^ will never produce the output label 1. However, we 
have taken the disorder average for A*- 1 -* < \( 2 \ so the observed result will naturally arise 
if the weight vectors of the optimal student pair satisfies JfX^ = on a subset of D. In 
addition, since we have take the thermodynamic limit first, X^ = X^ 1 ' may only hold in the 
large N limit and not for finite N. 

At a = 9.2 a continuous phase transition occurs with the t-matrix becoming non-diagonal 
(Fig. 4). It then has the form 



a b 
c 
6 0a 



(19) 
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This is followed by first order phase transition at a = 17.2 with jumping from to 0.55. 
While the t-matrix keeps its shape (JTHJ), the values of c and a change drastically. The class 
of solution the network is now exploring, stays stable with increasing a and has a simple 
interpretation since the values of a and b converge. This means that from the point of 
mutual information there is no difference between output and 2. In effect the three output 
classes architecture is emulating perceptrons which have just two output values but use the 
non-monotonic output function 6(A^ 2 ^ — | Jf£,\). While perhaps not quite as powerful as the 
reversed- wedge perceptron 8], this architecture will have a very high storage capacity, and 
this leads to a remarkably slow convergence of 1(a) to its asymptotic value of 0. 

The slow convergence for random examples suggests that it may be useful to regularize 
mutual information maximization and one way of doing this is considered in Fig. 3. The 
teachers have three output classes and biases = —k^ = 1.21 yielding I(X\,X2) = 1. 
The students also have three output classes but the training is regularized by choosing 
students which maximize the mutual information under the constraint that the t-matrix be 
diagonal, so the outputs of the two students must be identical on the training set. The 
constraint becomes noticeable at a = 9.4, where the achievable 1(a) is now lower than 
for the unconstrained case with R — 0, i.e. the random problem discussed above. Due to 
the constraint there is a continuous phase transition to positive R at this point. Next, at 
a = 10.9, a first order phase transition to the asymptotic regime occurs, and the structure 
in the data is recognized well. At this point the biases become nonzero and satisfy the 
symmetry A( 2 ) = Note that up to a = 43 the achievable 1(a) is smaller than for 

the unconstrained random map problem. So, regularizing the learning by constraining the 
student outputs to be equal, is essential for the good generalization observed for a values in 
the range [10.9, . . . , 43]. 



V. CONCLUSION 

We have seen that mutual information maximization provides a principled approach to 
unsupervised learning. Interestingly, from a biological perspective, it emphasizes the role of 
multi-modal sensor fusion in perception. In contrast to many other unsupervised learning 
schemes such as principal component analysis, mutual information maximization can capture 
very complex statistical dependencies in the data, if the architecture chosen for the two 
networks is powerful enough. 

For the generic data model given by Eq. (j2J), I have shown that the structure in the data 
is recognized by mutual information maximization if the training set is sufficiently large, i.e. 
the procedure is consistent in a statistical sense. However, the detailed statistical physics 
calculations yield that many examples are needed to reach this asymptotic regime and that 
the learning process is complicated by many phase transitions. One reason for this is, that 
a seemingly simple architecture such as a perceptron with three output classes can, from an 
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information theoretic point of view, be equivalent to a perceptron which has just two output 
classes but uses a non-monotonic activation function. 

Of course, when considering the number of examples needed for reliable generalization, 
one has to keep in mind that examples are often much cheaper in unsupervised than in 
supervised learning. On the other hand, the detailed calculations have been for cases, where 
the students are just perceptrons and there are only few output classes. When increasing the 
number of output classes or when more powerful networks are used, one will expect an even 
slower convergence. So, in applications, it may be necessary to compromise the generality 
of Becker and Hinton's approach by using suitable regularizations. We have considered one 
way of doing this, namely constraining the two networks to give the same output on the 
examples in the training set. 

A major limitation of the above statistical physics analysis is that I have only considered 
the replica symmetric theory. It is, however, evident that in many of the above scenarios 
replica symmetry will be broken. A case in point is the random map problem for two output 
classes where maximizing the mutual information yields a critical value a = 11.0 up to 
which 1(a) = 1. This value is equal to the storage capacity of the tree parity machine with 
two hidden units 0], as one would expect, by the equivalence of the two problems in the 
unbiased case But one step of replica symmetry breaking, considered in [9] for the tree 
parity machine, shows that the critical capacity is in fact some 25% smaller. 

To write down the one step symmetry breaking equations for mutual information maxi- 
mization, is a straightforward task. But given the numerical difficulties already encountered 
in solving the replica symmetric equations, the numerics of one step of replica symmetry 
breaking are daunting. While one will expect that some of the quantitative findings described 
above change when replica symmetry breaking is taken into account, one can reasonably as- 
sume that more qualitative aspects such as the nature of the phase transitions are described 
correctly by the present theory. 

It is a pleasure to acknowledge many stimulating discussions with Georg Reents and 
Manfred Opper. This work was supported by the Deutsche Forschungsgemeinschaft. 
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APPENDIX 

Our goal is to show that if the joint density of X 1 and X 2 satisfies (J2J), then I{X\\ X 2 ) = 
I(a-i (X|); <r 2 (X 2 )) implies Eq. (j3J). We shall need two facts from Information Theory, see e.g. 
|lOj |. The first is the data processing inequality (DPI), which states that for any mapping a 

I(X l] X 2 )>I(X l] a{X 2 )), (20) 

processing cannot increase information. The second is the chain rule which allows one to 
decompose the mutual information of a random variable X\ with a pair of random variables 
(X 2 ,X 3 ) via: 

i(Xr, x 2 , x 3 ) = i{x x] x 3 ) + i{x x - x 2 1 x 3 ), (21) 

where the last term denotes the mutual information of the conditional distribution of 
(Xi,X 2 ) given a value of X 3 , averaged over X 3 . 
Now, assuming Eq. (J2J), and 

I(X 1 ;X 2 ) = I(a 1 (X 1 );a 2 (X 2 )) (22) 

we have 

I{X i; X 2 ) = I(X 1 ;r 2 (X 2 ),a 2 (X 2 )) 

= J(X i; a 2 (X 2 )) + J(X i; r 2 (X 2 ) \ a 2 (X 2 )) 

= J(X i; X 2 ) + I(X 1 ; r 2 (X 2 ) | a 2 {X 2 )) (23) 

Here the first equality is a consequence of the DPI and (j22J), the second is the chain rule, 
and the third is again DPI and (j22j) . 

So I(Xi] t 2 (X 2 ) | cr 2 (X 2 )) = and this means that Xi and t 2 (X 2 ) are conditionally inde- 
pendent given a 2 (X 2 ). In other words: 

p{X u r 2 {X 2 ) | a 2 {X 2 )) = p(X 1 | a 2 (X 2 )) p(r 2 (X 2 ) \ a 2 (X 2 )) (24) 

or 

p(Xi, r 2 (X 2 ), a 2 (X 2 )) = p{X x , a 2 {X 2 )) p(r 2 (X 2 ) | a 2 {X 2 )) (25) 

But from the definition of the joint density (J2J we see that p(X\,T 2 (X 2 ),a 2 (X 2 j) can only 
be nonzero if Ti(Xi) = t 2 (X 2 ) and in this case equals p(Xi, a 2 (X 2 )). So p(r 2 (X 2 ) \ cr 2 (X 2 )) 
is either zero or one and this means that t 2 (X 2 ) is a function of a 2 (X 2 ). By symmetry, this 
is also true of Ti(Xi) and <ti(Xl). 
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